Causal conditioning and instantaneous coupling in 

causality graphs 

Pierre- Olivier Amblard a ' b , Olivier J.J. Michel a 

a GIPSAlab/CNRS UMR 5216/ BP46, 
fvj 38402 Saint Martin d'Heres cedex, Grenoble, France 

^-H b The University of Melbourne, Dept. of Math&Stat., Parkville, VIC, 3010, Australia 

o 

£ 

^ Abstract 

The paper investigates the link between Granger causality graphs recently 
formalized by Eichler and directed information theory developed by Massey 
and Kramer. We particularly insist on the implication of two notions of 

)—{ causality that may occur in physical systems. It is well accepted that dy- 

Y\ namical causality is assessed by the conditional transfer entropy, a measure 

appearing naturally as a part of directed information. Surprisingly the notion 
of instantaneous causality is often overlooked, even if it was clearly under- 
stood in early works. In the bivariate case, instantaneous coupling is mea- 
sured adequately by the instantaneous information exchange, a measure that 

lO supplements the transfer entropy in the decomposition of directed informa- 

tion. In this paper, the focus is put on the multivariate case and conditional 
graph modeling issues. In this framework, we show that the decomposition 

^ of directed information into the sum of transfer entropy and information ex- 

change does not hold anymore. Nevertheless, the discussion allows to put 
forward the two measures as pillars for the inference of causality graphs. We 

S^ illustrate this on two synthetic examples which allow us to discuss not only 

the theoretical concepts, but also the practical estimation issues. 
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1. Introduction 

1.1. Motivations 

Graphical modeling has received a major attention in many different do- 
mains such as neurosciences [18] . econometry [8], complex networks [25]. It 
proposes a representation paradigm for explaining how information flows be- 
tween the nodes of a graph. The graph vertices are in most cases, and in 
particular in this paper, associated to synchronous time series. Inferring a 
graph thus requires to define edges or links between the vertices. Granger 
[TBI [T7] proposed a set of axiomatic definitions for the causality between say 
x and y (with a slight abuse of notation, each vertex will be named after its 
associated time series). Granger's definitions are based on the improvement 
that observations of x up to some time t — 1 may provide for predicting y 
at t. The fundamental idea in Granger's approach is that past and present 
may cause the future but the future cannot cause the past [16, Axiom A]. 
Granger's work also stresses the importance of side information, accounting 
for the presence of all other vertices but x and y, for assessing the existence 
of a link between two nodes. This leads to what will be referred to as bivari- 
ate case (absence of side information) or multivariate case (presence of side 
information) in the sequel. The use of Granger causality in the latter case is 
mainly due to Eichler and Dahlaus [Zl El IS]- 

In [9], precise definitions of Granger causality graphs are presented, and 
the two notions of dynamical causality and instantaneous causality (as we 
call them in this paper) are put forward. Note that the notion of instanta- 
neous causality was present in the early works on Granger causality, but this 
notion, which seems quite weak compared to the other, has been overlooked 
in the modern studies on causality, especially in the applications. Instan- 
taneous dependence in complex networks may arise from different origins. 
Actually, if one cannot easily conceive instantaneous information exchange 
between nodes, the recording process (including filters, sample and hold de- 
vices, converters) contains integrators over short time lags. Any information 
flowing between two nodes within a delay shorter than the integration time 
may then be seen as instantaneous. Such a case is often met in systems 
requiring long integration times per sample, as for example in fMRI. Alter- 
nately, instantaneous coupling may occur if noise contributions in structural 
models are no longer independent. 



1.2. Aim of the paper and outline 

The purpose of this paper is to provide a new insight in the problems 
related to instantaneous coupling, and to show how the presence of such cou- 
pling may affect the estimated structure of a graphical model that should 
provide a sparse representation of a complex system. The paper is focused 
on the interplay of two types of causality: dynamical causality and instanta- 
neous causality, overlooked in all other works on directed information. The 
possibility to estimate directed information based measures with /c-nearest 
neighbors based tools is illustrated. 

We begin the paper by a short review of possible approaches to Granger 
causality. Section III then introduces a brief review and some definitions of 
Eichler and Dalhaus causality graphs pj [9] and presents an enlightening toy 
problem, where instantaneous coupling strongly affects the edge detection in 
a graphical model. Theoretical relations exhibiting the link between directed 
information theory and Granger causality graphs are developed in the follow- 
ing section. The last section discusses some practical implementation issues 
and gives a full treatment of the toy problem studied previously. 

2. Approaches to Granger causality 

2.1. Model based approaches 

In Geweke's pioneering work |TTl IT2] an autoregressive modeling approach 
(for the bivariate as well as for the multivariate case) was adopted in order 
to provide a practical implementation of Granger causality graphs. Such 
a model based approach motivated further studies : Information theoretic 
tools were also added by Rissanen and Wax [29], in order to account for the 
regression model complexity. Directed transfer functions, that are frequency 
domain filter models for Granger causality, were derived in [TBI E] for neu- 
roscience applications. Nonlinear extensions have been proposed |14j . with 
recent developments relying on functional estimation in RKHS [22j [3] . All 
these approaches are intrinsically parametric, and as such, may introduce 
some bias in the analysis. 

2.2. Information theoretic based measures 

An alternative for assessing the existence of a link between nodes was early 
elaborated in the bivariate case (see for example a sample from the literature 
[301 IT5| |3T| [28| [32j). It consists in adapting information theoretic measures 



such as mutual information or information divergences to assess the existence 
and/or strength of a link between two nodes. The motivations for introducing 
such tools rely upon the ability of information theoretic measures to account 
for the entire probability density function of the observations (provided that 
such a density exists), instead of only second order characteristics as for 
linear filter modeling approaches. Among these references, one of the oldest 
and may be the less known was developed by Gourieroux et. al. [15j where a 
generalization of Geweke's idea [TTJ using Kulback divergences is introduced. 
It is noteworthy that the tools they introduced was later rediscovered by 
Massey and Kramer in their development of bivariate directed information 
theory. 

The development of directionality or causality specific measures was initi- 
ated by Marko's work on directed information [23], and extended by Massey 
|24j . and later Kramer [19] who introduced causal conditioning by side in- 
formation. This offers a means to account for side information, or to tackle 
the multivariate case. First steps in exploring the relation between Geweke's 
approach of Granger causality and directed information theoretic tools were 
made in [1] for the Gaussian case and further insights are developed in [2], 
or in [26] in the absence of instantaneous dependence structure. In [2], a 
directed information based new definition is proposed for Granger causality. 
Eichler's recent paper [H] studies this latter issue in a graph modeling frame- 
work either from a theoretical point of view recoursing to probability based 
definitions, or in a parametric modeling context. 

3. Causality graphs 

We briefly review the notion of causality graph as developed by Eichler. 
The main reference is [H] where a complete presentation of causality graphs 
as well as a study of their Markovian properties are developed. 

3.1. Definitions 

Let xy = {xy(k),k G Z} be a d- dimensional discrete time stationary 
multivariate process on some probability space. The probability measures are 
assumed to be absolutely continuous with respect to Lebesgue measure, and 
their density associated to it will be noted P. V is the index set {1, . . . , d}. 
For a G V we denote x a as the corresponding component of xy. Likewise, 
for any subset A C V, xa is the corresponding multivariate process. The 



information obtained by observing xa up to time k is resumed by the filtration 
generated by {xa(1),VI < k}. It is denoted as x\. 

Following [161 [TTj, |9], three definitions may be proposed for Granger 
causality The first one is based on simple forward prediction, the root con- 
cept underlying Granger causality. The two next definitions correspond to 
alternative choices in defining instantaneous causality . Let A and B be two 
disjoint subsets of V. Let C = V\(A U B). 

Definition 1 (Dynamical), xa does not (dynamically) cause xb if for all 

kez, 



P(x B {k + l)\x k A ,x B ,x k c ) = P{x B {k + l)|:r| 



Bi x C 



Dynamical Granger causality states that x causes y if the prediction of y from 
its past is improved when also considering the past of x. Moreover, this is 
relative to any side information observed prior to the prediction. This is the 
meaning of definition [I] Conditional to its past and to the side information, 
Xb is independent of the past of xa- In mathematical terms, x B — > x k A — > 
Xs{k + 1) is a Markov chain conditionally to the side information (xq). 

Conditioning on x k c instead of x c +1 in def. lpaises an important issue: In 
a model estimation framework not aimed at identifying links between possibly 
all pairs of nodes, one may think about accounting for the present of xc in 
the prediction problem; this is for instance the case for ARMA modeling. 
However, conditioning on x^ 1 weakens the effectiveness of the definition of 
causality by introducing a symmetry in the causal relationship between B 
and C. Conditioning is therefore restricted to the past of the observation, 
in a strict sense. This excludes the possibility of instantaneous dependences, 
for which a separate definition is required. There are however two possible 
definitions. 

Definition 2 (Instantaneous), xa does not (instantaneously) cause xb if 
for all k G Z 7 



x Bi x c 



P(x B (k + 1) \x k A + \ 4, x c +1 ) = P( x B(k + 1) |4, ~ A ■■'■ + ] ' 
The second possibility is the following. 



Definition 3 (Unconditional instantaneous), xa does not (uncondi- 
tionally instantaneously) cause xb if for all fceZ, 

p(x B (k + 1) |4 +1 , 4, 4) = p(x B (k + 1)|4, 4, 4) 

Firstly, definitions [2] and [3] are easily shown to be symmetrical in A and B 
(application of Bayes theorem). Secondly, taking as side information x c +1 in 
def. instead of x k c in def. pi is fundamental here. If the side information is 
considered up to time k only, the instantaneous dependence or independence 
is not conditional to the remaining nodes in C. In fact inclusion of all the 
information up to time k in the conditioning variables allows to instanta- 
neously test dependence or independence between XA^k + 1) and xs{k + 1). 
The independence tested is not conditional if xc{k + l) is not included in the 
conditioning set, whereas the independence tested is conditional if xc{k + 1) 
is included. Thus the choice is crucial when dealing with the type of graph of 
instantaneous dependence obtained. In definition [2] the graphs obtained are 
conditional dependence graph as usual in graphical modeling [361 121] • On 
the contrary, graphs obtained with definition [3] are dependence graph which 
do not have the nice Markov properties that conditional dependence graphs 
may have. 

The two possible types of causality (dynamical or instantaneous) will 
be encoded on the graphs by two different types of edges between vertices. 
Dynamical causality will be represented by an arrow, hence symbolizing di- 
rectivity, whereas instantaneous causality will be represented by a line. 

3.2. A Detailed example 

For the sake of illustration we consider a four dimensional simple example. 

Let pi,2,3 £ ( — 1, 1) and let 
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be the covariance matrix of the i.i.d. zero mean Gaussian sequence 
(£ w ,t,£x,t,£y,t,£z,t) T ■ The inverse of T e , known as the precision matrix, re- 
veals the conditional independence relationship between the components of 



r £ 



-i 



(2) 




the noise (since it is Gaussian), and reads 

/ di -d 1 p 1 \ 

-dipi d x d 2 {l - p\p\ - pi) d 2 p 2 p3 ~d 2 p2 

d 2 p 2 p3 d 2 {l-pl) -d 2 p 3 

\ -d 2 p 2 -d 2 p 3 d 2 ) 

where d\ = 1/(1 — pf), d 2 = 1/(1 — p|— p|). Consider the following structural 
model 

= fw\Wt~l, Xt-l, Zt-l) + £ w j 
— fx{Xt-l, Zt-l) + £ x ,t 

= fy{xt-i,Vt-i)+e v ,t 

= fz(Wt-l, Zt-l) +£z,t 

To infer the causality graph, we first look for directed link between pairs of 
nodes. In such a structural model, if a signal a at time t depends through 
the function f a on another signal /3 at time t — 1, then there is a link f3 — > a. 
For example, consider the question of whether there is a link from z to w or 
not? We have from the definition of the model 

P(wt\w t ~ 1 ,z t ~ 1 ,(x,yY' 1 ) = P £w (w t - f w {w t -t, x t -i, z t -i)) 

P(w t \w t ~ 1 ,(x : y) t ~ 1 ) = E t t-i[P eu (wt-f w (w t -i,Xt-i,Zt-i))] 

which are obviously not equal here. Therefore z — > w\x,y. Consider 
now the case of z and y. We have P(yt|y t_1 , z' -1 , (x, w) 1 ^ 1 ) = P £y {yt — 
fy(x t -i,y t -i)) = Piytly^ 1 , (x,wY~ l ). Thus z -/-^ y\x,w. Doing this pair- 
wise or following the intuitive point of view described above leads to the 
set of oriented edge depicted in the causality graph in figure Q. To get 
the instantaneous edges, as discussed in the previous section, we have two 
possible definitions. If side information is considered up to time t — 1, we 
obtain the unconditional graph in figure (fTl). Indeed for the unconditional 
graph, testing for the presence of an edge between x and y, we evaluate 
P(x t \x t ~ 1 ,y t , (WjzY^ 1 ) = P(e x \e y ) = P(e x ) since e x and e y are independent 
(examine T £ and remember the noises are Gaussian). Note that doing this for 
all pairs, we really obtain the graph of dependence relationships. For the con- 
ditional graph, we instead evaluate P{x t \x t ~ 1 ,y t ,{w,zY) = P(e x \e y ,e w ,e z ). 
In this case, we really measure the conditional dependence between x and y. 
It turns out in the example that even if independent, e x and e y are dependent 
conditionally to e z , and therefore there is an undirected edge between x and 
y in the conditional graph. 



4. Directed information and causality graphs 

We start with a brief reminder on the main definitions of directed infor- 
mation and some related results. Bivariate analysis results are sketched, to 
provide better insight in discussing the multivariate case. 

Massey's work focusses on information measures for system that may 
exhibit feedback [21] . In this framework, Massey proved that the appropriate 
information measure was no longer the mutual information but the directed 
information. For two subsets A and B, directed information is defined by 



04 -> 4) = Y, 7 fe xb{i)\x% l ) 



j=l 



where I (x A ; x B \x B ) stands for the usual conditional mutual information [B]. 



Later in |19j . Kramers introduced the idea of causal conditioning and defined 
the causally conditioned entropy as a modified version of the Bayes chain 
rule for conditional entropy: While the usual chain rule writes H(x B \x A ) = 
^2i i=x H (x B (i)\x' l B ~ ,x A ), causally conditioned entropy is defined as 

k 

H(x B \\x k A ) = Y^H(x B (i)\x i B \x i A ) . 
i=i 

The difference lies on the conditioning on xa which is now considered up to 
time % only for each term entering the sum. From the definitions above, the 
directed information is easily decomposed into the difference of two terms 

7(4 -> x B ) = H{x k B ) - H{x k B \\x A ) 

which could be compared to the well known (sometimes admitted as a defi- 
nition) formula for the mutual information I(x A ;x B ) = H(x B ) — H(x B \x A ). 
Assuming the presence of side information, causal conditioning of directed 
information is thus defined by substituting causally conditioned entropies 
to entropies in the definition of directed information. Causally conditioned 
directed information is given by 

k 

= J2l{x A ;x B (i)\x B : 1 ,x i c ). (3) 

t=i 



From these definitions, Massey and Kramer derived two interesting re- 
sults. The first one is the following equality, where Dx A = (0, x A l ) repre- 
sents the delayed (one time lag) version of xa '■ 

I(x k A ->x k B )+I(x k B ->x k A ) = I(x A ;x k B )+I(x k A ^x k B \\Dx A ) (4) 

This implies that the sum of the directed information is larger than the 
mutual information. The term I(x A —¥ x B \\Dx A ) is positive (sum of posi- 
tive contributions) and accounts for the instantaneous information exchange. 
Using equation ^ one easily gets 

k 
I{x k A ^x k B \\Dx A ) = Y^ I (^A{i)]x B {i)\x i B 1 ,x i A l ) (5) 

which is symmetric with respect to A and B. 

It is noteworthy that by its definition, directed information accounts for 
instantaneous information exchange as well as for dynamical information 
exchange. Then, in the sum of the directed information in the l.h.s. of 
equation Q, the contribution of instantaneous information is counted twice. 
It is counted only once in the mutual information, and this explain the 
remaining term in the r.h.s. of the equation. 

The instantaneous information exchange term I(x A — > x B \\Dx A ) is zero if 
and only if / (^x A (i);x B (i)\x l B 1 ,x' l A 1 ^ =0,Vi, i.e. x A and x B are independent 
conditionally on their past. Such a situation may occur for multivariate 
Markov processes described by Xy(t) = f(X v (t — 1)) + ey(t), where ey(t) 
is an i.i.d. multivariate noise process with independent components. Note 
that in the example of the preceding section, T £ is not diagonal, therefore the 
noise components are correlated and lead to some instantaneous information 
exchanges between some nodes (in a non trivial way). 

We are at this point ready to examine how directed information may be 
used in causality graphs. In front of multivariate measurements, two ap- 
proaches are possible. The first one is a bivariate analysis in which we study 
directed information between pairs of nodes, forgetting the side information 
(remaining nodes). The second one accounts for side information but will 
need some more developments. Even if the bivariate framework is a naive 
approach, it is presented since it gives some insights on how directed infor- 
mation is applied. We then turn to the more tricky multivariate analysis. 
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4-1. Bivariate analysis in graphs 

Consider two disjoint subsets A and B of V. The directed information 
may be re-expressed as the sum 



I(x k A -> x%) = I{x A ->■ x k B \\Dx k A ) + I{Dx A -)> x 



b;> 



where the first term is the instantaneous information exchange defined by 
equation (J5J), whereas the second I(Dx A — )■ x^) will be referred to as the 
transfer entropy, following Schreiber definition proposed in a different frame- 
work |31| . In the absence of any side information, these terms account for the 
instantaneous causality and for the dynamical causality respectively. Indeed, 
the transfer entropy reduces to zero if and only if I(x l A l ; xb^Ix 1 ^ 1 ) = 0, Vz 
or equivalently if and only if xa does not dynamically cause xb (see def. [I]). 
Furthermore, we have seen above that I(x A — > x B \\Dx A ) = if and only if 
xa and xb are independent conditionally on their past, or in the words of 
our definitions, if and only if xa does not instantaneously cause xb- This 
result extends those obtained in the Gaussian bivariate case in pQ and in [5] 
restricted to the dynamical causality. Again, these conclusions hold in the 
sole case where no side information is considered. 

4-2. Multivariate analysis in graphs 

It is assumed in the sequel than the set of measurements or nodes V is 
partitioned into three disjoint subsets A, B and C = V\(A U B). We study 
information flow between A and B when side information C is considered. 
Mathematically, taking into account side information corresponds to condi- 
tioning on the side information. As we outlined earlier since we deal with 
the graph V , we must use causal conditioning or we would break the sym- 
metry between either A or B and C. This leads to relate causal conditional 
directed information to the definitions [TJ [2] and [3j Since we have two possible 
definitions for instantaneous causality, we have two possible choices for using 
the side information as a conditioner: We may use the past x^T 1 = Dx k c or 
the past as well as the present x k c . 

Conditioning on the past: We evaluate I(x A — > Xq\\Dxc). This can 
be written as 

I(x k A ->■ x B \\Dx k c ) = I(Dx A -> x B \\Dxc) + I{x A ->■ x B \\Dx A , Dx k c ) 

We call the first term of this decomposition I(Dx A — > x^WDx^) the 
conditional transfer entropy between A and B given C. It is zero if 
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and only if I {x l A , X B{i)\x B , x 1 ^ 1 ) = 0, Vz or equivalently if and only if 
P(xB(k)\x A ' 1 ,x k f 1 ,x k ^ 1 ) = P(x B {k)\x k B l ,x k c l ). In other words, according 
to definition [TJ the conditional transfer entropy between A and B is zero if 
and only if A does not dynamically cause B. 

The second term I(x A — > x B \\Dx A , Dx^) is zero if and only if 
I(xA{i)',XB{i)\x l A i x% b i x% c ) = 0, Vi and therefore, according to definition 
[2] if and only if A does not unconditionally instantaneously cause B. We 
will refer to this measure as the unconditional instantaneous information ex- 
change. Note that the "unconditional" term refers to the nature of the type 
of independence the measure reveals. 

Conditioning up to the present: We evaluate I(x A — >■ x^||x^). The 
idea is to find a decomposition in which both the conditional transfer entropy 
and a measure accounting for definition [3] appears. Applying several times 
the chain rule for conditional mutual information, the defining term for the 
causal conditional directed information I(x A — > x^Wx^) verifies 

I{x A \x B {i)\x l B 1 ,x % c ) = I{x c {i),x A \x B {i)\x l B 1 ,x' l Q 1 ) - I(x c (i)]x B (i)\x l B l ,Xc 

= /(x^zs^la;^ 1 ,;^ 1 ) + I(x A (i);x B (i)\x i Z 1 ,x% 1 ,xi;) 

+ I(x c {i);x B {.i)\x l A l ,x l B l ,x^ 1 ) - I(xc{i)]x B {.i)\x i B l ,x i c 1 ) 

Summing over i we get the conditional causal directed information 

I{x k A -> x k B \\x k c ) = I(Dx k A -> x k B \\Dx k c ) + I{x k A -> x k B \\Dx k A ,x k c ) + AI(Dx k c -> x k B \\Dx A , Dx k c ) (6) 

The term I(x A — > XqWDx^Xq) is called the conditional instantaneous in- 
formation exchange. It is equal to zero if and only if definition [3] is verified, 
that is if and only if A does not instantaneously cause B. We recover in 
the decomposition the conditional transfer entropy accounting for dynamical 
causality. The surprise arises from an extra-term in eq. (|6| defined as 

AI{Dx k c -> x k B \\Dx k A , Dx k c ) = I{Dx k c -> x k B \\Dx A , Dx k c ) - I{Dx k c -► x k B \\Dx k c ) 

This term is also measuring an instantaneous quantity. It is the differ- 
ence between two different natures of instantaneous coupling: The first term 
I (xc(i); x ^(i)!^^" 1 , x % B x , x 1 ^ 1 ) describes intrinsic coupling in the sense it does 
not depend on other parties than C and B\ The second coupling term ex- 
pressed by I(xc(i)]XB{i)\x l B l i x% c l ) is relative to extrinsic coupling since it 

11 



measures the instantaneous coupling at time % created by other variables than 
B and C. 

The conclusion is the following: causal directed information is the right 
measure to assess information flow in Granger causality graphs if the un- 
conditional definition is adopted for instantaneous causality In this case, 
causal directed information I(x k A — > x^\\Dxq) is zero if and only if there 
is no causality from A to B. If not zero, we must evaluate the conditional 
transfer entropy and the unconditional instantaneous information exchange 
to assess dynamical and instantaneous causality. However, as shown by Eich- 
ler [HI E] , the graphs obtained in this case do not have nice properties since 
the instantaneous graph is not a conditional dependence graph. 

On the other hand, if we adopt definition [2] for instantaneous causality, 
we do not have the same nice decomposition, and I(x\ — > x^\\xq) cannot 
be used to check non causality. However, we have shown that the correct 
measures to assess dynamical and instantaneous causality are respectively 
the conditional transfer entropy I(Dx\ —¥ x^\\Dxq) and the instantaneous 
information exchange I{x\ —¥ x^\\Dx\,Xq). 

5. Illustrations 

This section is devoted to the practical application of the previous results. 
We begin by discussing estimation issues and illustrate the key ideas on 
synthetic examples. 

5.1. Estimation issues 

The estimator we use are based on Leonenko's fc-nearest neighbour esti- 
mator of the entropy. Let Xi, i = 1, . . . , N N observations of some random 
vector x taking values in M. n . Then Leonenko's estimator for the entropy 
reads [13] 

1 - 
H k (x) = - J2 lQ g (i N ~ l)C k V n d(xi, x m ) n ) 

i=l 

In this expression, d : W l x W 1 — > M + is a metric, x^k) is defined to be the 
kth nearest neighbor of x%. V n is the volume of the unit ball for the metric d; 
Ck = exp(—ip(k)), ip(.) is the digamma function defined as the derivative of 
the logarithm of the Gamma function. It is shown in [13J that this estimator 
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converges in the mean square sense to the entropy of the random vector x 
(under the i.i.d. assumption of the Xj) for any values of k lower than N — 1. 

Estimation of the conditional mutual information. To estimate 
a directed information, we need to estimate conditional mutual information 
I(a, b\c) = H(b, c) + H(a, c) — H(a, b, c) — H(c). Thus estimating the condi- 
tional mutual information can be done using four applications of Leonenko's 
estimator. Although it is asymptotically unbiased, Leonenko's estimator is 
biased for finite sample size, and the bias depends on the dimension of the 
underlying space. Apart from the fact that a plug-in estimator would suffer 
from a high variance, the bias of the entropy estimators will therefore not 
cancel out. 

A smart idea to circumvent this problem was proposed by Kraskov in 2004 
for the mutual information case, and later extended by Frenzel and Pompe 
for the conditional mutual information case [201 HO] ■ The idea relies on two 
facts: The estimator is valid for any metric, and the estimator converges for 
any k < N — 1. The idea is then to use as a metric in the product space 
the maximum of the metric used on the marginal spaces. This determine as 
a scale in the product space the distance d(xi,x^k)) between Xi and its fcth 
nearest neighbor. This distance is then used on the marginals to determine 
k' for which d(xi,x^k)) is the distance between Xi projected on the marginal 
to its k' nearest neighbour. 

Estimation of the directed information. Estimation requires that 
the processes studied are ergodic and stationary. Without these basic as- 
sumptions, nothing can really be done. The goal is to estimate the transfer 
entropy and the instantaneous information exchange. When dealing with 
monovariate signals XA(k) = x(k) and XB{k) = y(k), and with side informa- 
tion Xc{k) the information measures read 



I(Dx k ^y k \\Dx k c ) = ^/(x*- 1 ;^)^- 1 ,^- 1 ) 
I(Dx k ->y k \\Dx\x k c ) = ^/(xW;^)!^- 1 ,^- 1 ,^) 



For stationary sequences, it is convenient to consider the rate of growth 
of these measures. Indeed, the measures are often linearly increasing with k. 
Thus the rate is defined as the asymptotic linear growth rate. Furthermore, 
following the proof in [T9] for the directed information (or in [6] for the 
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entropy), it can be shown that 



1 



lim f I(Dx k -> y k \\Dx k c ) = lim I(x k - 1 ;y(k)\y k - 1 ,x^ u 

k— >+oo K k-' ' 



hOO 



lim -I{Dx h ^y k \\Dx k ,x k c ) = lim I{x{k);y{k)\y k - l ,x k - l ,x k c ) 

Suppose now that we are dealing we finite order joint Markov sequences. 
Then by working with vectors, we can represent signal using an order 1 
Markov multivariate process. We thus assume that (x, y, xc) is a Markov 
process of order 1. Under this assumption and stationarity, we have 



■'-'-wiy*- 1 ,**- 1 : 

.fc-1 ~fc-l ~k 



lim I(x k ~ l ;y(k)\y k -\x k c - 1 ) = /(x(l);y(2)|y(l),xc(l)) 



lim I(x(k);y(k)\y k ~\x k -\x k c ) = I(x(2);y(2)\y(l),x(l),x 2 c ) 

k— >+oo 

and in this case, we can estimate the conditional transfer entropy and the 
instantaneous information exchange from data. 

Practically, from two times series x and y and a pool of others xc, we cre- 
ate from the signals the realizations of the vectors x(l)i = x l ~} d) yiX)i = yl-d, 
x c {l)i = %cl-d and x Ci = x c,i-di and estimate I(x(l); y(2)\y(l) , x c {l)) and 
I(x(2)]y(2)\y(l),x(l),X(v) using these realizations and the fc-nn estimators 



described above [201 HO] ■ This approach has already been described in [35] 
for the transfer entropy. 

5.2. Synthetic examples 

We develop here two synthetic examples to illustrate the key ideas devel- 
oped in the paper. In the first example, we stress the importance of causal 
conditioning using a simple causality chain. The second example is a partic- 
ular instance of the example developed in the second section of the paper, 
for which we estimate dynamical and instantaneous causality measures. 

5.3. A chain 

Consider the following three dimensional example, in which the noises are 
i.i.d. and independent of each other. 



x t = 


= bxt-i + £ x ,t 


Vt -- 


= cy t -i + d xy x\_ x + £ y>t 


Zt = 


= dzt-i + c yz y t ~i + e z>t 
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where a = 0.2, b = 0.5, c = 0.8, d,^ = 0.8, c yz = 0.7. Firstly, we evaluate 
Geweke's measure based on linear prediction error [TTJ [12] (logarithm of the 
ratio between variances of linear prediction). The measures are evaluated on 
100 independent realizations of length 3000 samples of the processes. They 
are depicted in figure pi) in the form of histograms. As can be seen, the his- 
togram for the conditional Geweke measure F xz » y has the same support as the 
histogram of the unconditional measure F xz . Therefore, we have an example 
where linear Granger causality gives the same answer whether conditional or 
not: x does not dynamically cause z (conditional or not to y). 

We then evaluate the transfer entropy I(Dx — > z) = I(xl-h z t\ z t-2) an d 
the conditional transfer entropy I(Dx — > z\\Dy) = /(x*!^! z t\ z t-2iVt-2) on 
the same data sets. The results are depicted in the bottom of figure (J2J). 
We see that the histograms of the conditional measure is clearly centered 
around whereas the histogram for the unconditional measure has clearly a 
non overlapping support. Therefore we conclude that when side information 
is not taken into account, x causes z, whereas including y as side informa- 
tion reverses the conclusion. Therefore, the existing link from x to z passes 
through y. In the plot of the transfer entropy, we present the histograms of 
the measures for three different values of k, the number of nearest neighbors 
considered by the estimation. As seen and reported in [3], there is a trade-of 
between bias and variance as a function of k. The present lack of precise the- 
oretical analysis does not allow to optimize this trade-off in order to choose 
k (see however [34J for a work going in this direction). However, numerical 
simulations have shown that k should be chosen small as the dimension of 
the space increases. 

5.4- A four dimensional complete toy 



We come back to the example described in section 3J2 Below we provide 
an explicit form to the functional links 

+ £w,t 

+ e (7) 

and we recall that the noise sequence is white with covariance given by ([I]). 
For the purpose of the example, we set p\ = 0.66, p<i = 0.55 and p% = 0.48. 
To mimic a real experiment we have simulated a long time series from which 
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w t ~- 


= awt-i + az t -i + ex}_] 


x t ~- 


= bxt-i + fzl_ x + £ X)t 


Vt -- 


= cy t _i + pxt-i + gx\_ x 


z t ~- 


= dzt-i + jwt-i + e Ztt 



Nt> = 100 consecutive blocks of 3000 samples each was used to generate the 
realizations of the process. Thus, all the information measures needed were 
evaluated on these blocks. Furthermore, we perform random permutations 
to simulate the independence situation called H . Precisely, when estimating 
I(a; b\c) from samples a i; b iy c i; the permutation is done on the 6j's. Indeed if b 
is independent from a and c then I(a; b\c) = 0. For example, when estimating 
the transfer entropy /(x*!^ -^kt-^) we use permutation for z t but not for z*Z 2 - 
Thus for each block, two measures are actually performedcorresponding to 
the one that needs to be evaluated and another one for which Hq hypothesis 
is forced . The N b results under H allow to evaluate the threshold rjij over 
which only a% of false positive decisions (there is a link from i to j) will be 
taken. Practically we set a = 10%. Since for this toy problem 12 dependence 
pairwise tests need to be made, the BonferronnQ correction is applied to the 
threshold, in order to maintain the family-wise global false detection rate. 9 
different measures were tested on this example: 

1. Geweke's instantaneous causality measure 

F™= lim ^l*" -1 '*"- 1 ) 



xy n->+oo e(x n \x n ~ 1 ,y n ) 

where e(x\z) is the variance of the error in the linear estimation of x 
from y. 
2. Geweke's conditional instantaneous causality measure 

e(x n \x n -\y n -\(w,z) n ) 
r m , = lim 



xy n->+oo e(x n \x n 1 ,y n ,(w,z) n ) 

3. Geweke's dynamical causality measure 

eixnlx™- 1 ) 
iV-», = hm 



n-5-+«D e(x n \x n ~ 1 ,y n r 



4. Geweke's conditional dynamical causality measure 

e(x n \x n - l ,(w,z) n - 1 
iC^„ = hm 



'^y „^ +00 e (x n \x n - 1 , y n ~ l , (w, z) n ~ l ) 



1 i.e. a is replaced with a/12 to ensure a family false positive rate less than a%. 
Note that Bonferronni correction is known to be very conservative, and less conservative 
procedure such as False Discovery Rate control could be easily adopted. 
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5. Instantaneous information exchange I(x n — > y n \\Dx n ). 

6. Instantaneous unconditional information exchange I(x n — > 
y n \\Dx n ,Dw n ,Dz n ). 

7. Instantaneous conditional information exchange I(x n — > 
y n \\Dx n , w n , z 11 ). 

8. Transfer entropy I(Dx n —¥ y n ). 

9. Conditional transfer entropy I(Dx n — > y n \\Dw n ,Dz n ). 

Geweke's measures are based on linear estimation. Note that they take 
values larger than one when x causes y and equal to one otherwise. We 
stressed that (up to a log) Geweke's measure are the Gaussian version of 
directed information measures discussed here [U [2]. Information measures 
were estimated using the appropriate conditional mutual information def- 
initions, with time lag windows of length 2, e.g. the conditional trans- 
fer entropy I(Dx n — > y n \\Dw n ,Dz n ) is approximated by the estimation of 
I(xl-2'i yt\yt-2i w t-2i z t-\)- The results are depicted in figure (pi). The mea- 
sures 1 to 9 are depicted from top to bottom. The left column represents 
the matrix of the measures averaged over the Nb blocks. The right column 
represents the matrix of estimated probabilities of deciding that there is a 
link between two nodes. To estimate that there is a link, we use the threshold 
r/ij discussed above. To evaluate Geweke's measure we perform a linear pre- 
diction using 10 samples in the past and evaluate the variance of the errors. 
Note that the diagonal of all these matrices is put to arbitrarily to zero since 
the diagonal is not informative in this study. 

The main conclusions to be drawn from this experiments are the following. 

• The linear analysis, whether causally conditional or not, implemented 
using Geweke's measures, fails to retrieve the structure of the causality 
graphs. 

• The instantaneous information exchange must be causally conditioned, 
since the results given in the fifth line of the figure ( Jey ) does not 
reveal the exact nature of the dependencies. 

• The importance of the horizon of causal conditioning appear in the 6th 
and 7th line where we plot the results for respectively the unconditional 
and conditional instantaneous information exchange. The measures are 
correctly estimated, since the probability of assigning links is very high 
as shown in the right column: the form of the matrices are correct. 
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We recover the form of the covariance matrix of the noise using the 
unconditional form whereas we recover the form of the precision matrix 
using the conditional form of the instantaneous information exchange. 
Note on this example a rather low probability of estimating the link 
between x and y in the conditional form. 

• The causality graph needs causal conditioning to be correctly inferred, 
as revealed by the two last measures. However note again a rather low 
probability of estimating correctly the link from w to z, a difficulty 
clearly due to the low coupling constant 7 existing in this direction. 
Trying to increase this coupling to study the sensitivity is unfortunately 
impossible since increasing slightly 7 destabilize the system. 

6. Conclusion 

In this paper, we have revisited and highlighted the links between directed 
information theory and Granger causality graphs. In the bivariate case, 
the directed information decomposes into the sum of two contributions: the 
transfer entropy and the instantaneous information exchange. Each term in 
this decomposition reveals a type of causality. Transfer entropy between two 
processes (say X and Y) is zero if and only if there is no dynamical Granger 
causality: the knowledge of the past of X dos not lead to any improvement in 
the prediction quality of Y. Instantaneous information exchange quantifies 
the instantaneous link that may exist between the two signals. 

In the multivariate case however, instantaneous causality gives rise to in- 
creased difficulties when relating directed information theory to the measures 
introduced in the bivariate case. We have recalled that two definitions of in- 
stantaneous causality may be given, depending on the time horizon selected 
in the consideration of side information. If the past of the side information 
is considered, instantaneous causality leads to a concept of independence 
graph models, whereas consideration of the present of the side information 
as well leads to a conditional graphical model. Preferring one of these defini- 
tions leads to a rather a difficult choice, discussed in this paper: Conditional 
graphs enjoy nice Markov properties whereas unconditional graphs represent 
a preferred solution in neuroscience, as it provides a better matches to the 
concept of functional connectivity [33J . 

We have also shown that if independence graphs are considered, directed 
information causally conditioned to the past of the side information decom- 
poses into the sum of the causally conditioned transfer entropy and the 



causally conditioned (independent) information exchange, directly extend- 
ing the bivariate result. This decomposition however does not longer hold 
in the other case. For the conditional graph an extra term appears in the 
decomposition. It further explains how instantaneous exchange takes place 
between the two signals of interest and the side information. 

All this theoretical framework finds some practical developments as il- 
lustrated on two synthetic examples. The estimators we used in this paper 
rely on nearest neighbors based entropy estimators. These estimators can be 
efficiently used as long as the dimensionality of the problems at hand is not 
high. 
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Conditional 



Unconditional 



Figure 1: Causality graphs for the example developed in the text, 
difference between the two definitions of instantaneous causality. 
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Geweke measures 
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Figure 2: Dynamical causality analysis from x to z in the first example. Top: linear 
analysis using Geweke's measures. Both conditional and unconditional measures lead 
to conclude that x -/+ z. Bottom: directed information theoretic analysis. The three 
different types of histograms correspond to three different choice of the number of nearest 
neighbours k for the estimation. As can be seen, the variance decreases with k but the 
bias increases. From the transfer entropy, since I(x — > z) > 0, we obtain x — > z whereas 
the condititional transfer entropy leads to x — > z since I(x — > z\\y) = 0. 
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Figure 3: Measures calculated from example 2. From top to bottom, instantaneous causal- 
ity and conditional instantaneous causalityr/Seweke's measures, dynamical causality and 
conditional dynamical causality Geweke's measure, instantaneous information exchange, 
unconditional instantaneous information exchange, conditional instantaneous information 
exchange, transfer entropy and finally conditional transfer entropy. The left column is the 
mean measure calculated over 100 realizations of 3000 samples each. The right column 
represented the number of time the corresponding measure exceeds a threshold chosen to 
ensure a family false positive probability of 10 % (using Bonferronni correction). 



